Avoiding Boosting Overfitting by Removing Confusing Samples

نویسندگان

Alexander Vezhnevets

Olga Barinova

چکیده

Boosting methods are known to exhibit noticeable overfitting on some datasets, while being immune to overfitting on other ones. In this paper we show that standard boosting algorithms are not appropriate in case of overlapping classes. This inadequateness is likely to be the major source of boosting overfitting while working with real world data. To verify our conclusion we use the fact that any overlapping classes’ task can be reduced to a deterministic task with the same Bayesian separating surface. This can be done by removing “confusing samples” – samples that are misclassified by a “perfect” Bayesian classifier. We propose an algorithm for removing confusing samples and experimentally study behavior of AdaBoost trained on the resulting data sets. Experiments confirm that removing confusing samples helps boosting to reduce the generalization error and to avoid overfitting on both synthetic and real world. Process of removing confusing samples also provides an accurate error prediction based on the work with the training sets.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Boosting-like Deep Learning For Pedestrian Detection

This paper proposes boosting-like deep learning (BDL) framework for pedestrian detection. Due to overtraining on the limited training samples, overfitting is a major problem of deep learning. We incorporate a boosting-like technique into deep learning to weigh the training samples, and thus prevent overtraining in the iterative process. We theoretically give the details of derivation of our alg...

متن کامل

Feature Selection for Descriptor Based Classification Models. 1. Theory and GA-SEC Algorithm

The paper describes different aspects of classification models based on molecular data sets with the focus on feature selection methods. Especially model quality and avoiding a high variance on unseen data (overfitting) will be discussed with respect to the feature selection problem. We present several standard approaches and modifications of our Genetic Algorithm based on the Shannon Entropy C...

متن کامل

Outlier Detection by Boosting Regression Trees

A procedure for detecting outliers in regression problems is proposed. It is based on information provided by boosting regression trees. The key idea is to select the most frequently resampled observation along the boosting iterations and reiterate after removing it. The selection criterion is based on Tchebychev’s inequality applied to the maximum over the boosting iterations of ...

متن کامل

Boosting by weighting boundary and erroneous samples

This paper shows that new and flexible criteria to resample populations in boosting algorithms can lead to performance improvements. Real Adaboost emphasis function can be divided into two different terms, the first only pays attention to the quadratic error of each pattern and the second takes only into account the “proximity” of each pattern to the boundary. Here, we incorporate an additional...

متن کامل

PDC-SGB: Prediction of effective drug combinations using a stochastic gradient boosting algorithm.

Combinatorial therapy is a promising strategy for combating complex diseases by improving the efficacy and reducing the side effects. To facilitate the identification of drug combinations in pharmacology, we proposed a new computational model, termed PDC-SGB, to predict effective drug combinations by integrating biological, chemical and pharmacological information based on a stochastic gradient...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2007

Avoiding Boosting Overfitting by Removing Confusing Samples

نویسندگان

چکیده

منابع مشابه

Boosting-like Deep Learning For Pedestrian Detection

Feature Selection for Descriptor Based Classification Models. 1. Theory and GA-SEC Algorithm

Outlier Detection by Boosting Regression Trees

Boosting by weighting boundary and erroneous samples

PDC-SGB: Prediction of effective drug combinations using a stochastic gradient boosting algorithm.

عنوان ژورنال:

اشتراک گذاری